Remove redundant _filter_ingested from streaming pipeline#271
Merged
gvanrossum merged 1 commit intomicrosoft:mainfrom May 5, 2026
Merged
Remove redundant _filter_ingested from streaming pipeline#271gvanrossum merged 1 commit intomicrosoft:mainfrom
gvanrossum merged 1 commit intomicrosoft:mainfrom
Conversation
The framework no longer does per-batch dedup queries inside add_messages_streaming. Callers are responsible for filtering duplicates before yielding messages into the stream. - ingest_email.py already pre-filters via is_source_ingested - ingest_vtt.py enforces a fresh DB (refuses existing) - podcast_ingest.py uses unique source_ids by construction This eliminates ~N unnecessary are_sources_ingested DB round-trips (one per batch) that always returned empty sets. Closes microsoft#269
1c71230 to
3875789
Compare
KRRT7
added a commit
to KRRT7/typeagent-py
that referenced
this pull request
May 5, 2026
The framework no longer populates messages_skipped (removed in microsoft#271), so the skipped counter and conditional output are dead code.
This was referenced May 5, 2026
gvanrossum
approved these changes
May 5, 2026
bmerkle
pushed a commit
that referenced
this pull request
May 5, 2026
…or (#268) ## Summary **Upfront bulk dedup** (commit 1): - Replaces N per-file `is_source_ingested` DB queries with a single upfront `are_sources_ingested` bulk query + in-memory `set` lookup - `_email_generator` now takes a pre-collected `email_entries` list and `already_ingested: set[str]`, skipping known files before parsing (avoids wasted parse cost) - Single `_iter_emails` call feeds both the bulk pre-filter and the generator - Removed dead `messages_skipped` tracking from batch callback (always 0 since #271) **Cleanup**: - Removes `_flush_skipped`, skip tracking variables, unused `IStorageProvider` import - Net -49 lines ## Stack **This PR is stacked on #267. Merge order: #271 → #267 → #268** | # | PR | Branch | Description | |---|---|---|---| | 1 | #271 | `fix/remove-filter-ingested` | Remove framework dedup | | 2 | #267 | `feat/vtt-streaming-ingestion` | VTT streaming migration | | 3 | **#268** (this) | `refactor/email-dedup-consolidation` | Email bulk dedup | ## Reproducible test First ingest (populates DB): ``` mkdir -p /tmp/test_eml # create 2 .eml files (any valid RFC 5322 .eml will do) uv run python tools/ingest_email.py /tmp/test_eml/ -d /tmp/dedup_test.db -v ``` Re-ingest (should skip all): ``` uv run python tools/ingest_email.py /tmp/test_eml/ -d /tmp/dedup_test.db -v ``` Expected output on re-ingest: ``` Pre-filter: 2 of 2 already ingested Parsing and importing emails... Successfully ingested 0 email(s) Ingested 0 chunk(s) Skipped 2 already-ingested email(s) Extracted 0 semantic references Total time: 0.0s ``` Key behavior: the bulk query fires once before the generator runs. No emails are parsed on re-ingest (0.0s total time). ## Test plan - [x] `make format check test` passes (696 tests, 0 pyright errors) - [x] Manual test: re-ingest same .eml files, verify all skipped via upfront bulk dedup - [x] Manual test: ingest with `--verbose`, verify skip reporting in summary --------- Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_filter_ingestedmethod and all per-batchare_sources_ingestedDB queries fromadd_messages_streamingingest_email.pyalready pre-filters viais_source_ingestedbefore yieldingingest_vtt.pyenforces a fresh DB (refuses to run if it exists)ingest_podcast.pyuses unique source_ids by constructionbatch_skippedcounter fromingest_email.pyWhy: on a 10k-email re-ingest with batch_size=100, the framework was issuing ~100
are_sources_ingestedqueries that always returned empty sets (because the caller already filtered). Pure waste.Stack
This is the base of a 3-PR stack. Merge order: #271 → #267 → #268
fix/remove-filter-ingestedfeat/vtt-streaming-ingestionrefactor/email-dedup-consolidationTest plan
make checkpasses (pyright 0 errors)pytest tests/passes (696 tests — 5 removed that tested framework-level dedup)Closes #269